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Abstract 

Motivation: Detection and quantification of active transcripts 
using RNA-Seq is a central task to transcriptomics research. 
Initial efforts on mathematical or statistical modeling of read 
counts or per-base exonic expression signal have been successful 
but may face an increasing risk of model misspeciflcation and 
overfitting. This is because the number of reference transcripts 
in the database is much larger than that of the active transcripts 
expressed under a single biological condition, and the difference 
is getting larger with the accelerated augmentation of transcripts 
database. To overcome this risk, the reference transcripts that 
are not supported by the exonic expression signal may be pe- 
nalized using shrinkage approaches. The standard shrinkage ap- 
proaches, such as Lasso, shrink all the transcript abundances to 
zero in a blind way, thus, it does not necessarily lead to the set 
of active transcripts. Informed shrinkage approaches, motivated 
by the observed exonic expression signal, are thus desirable. 
Results: We propose a new mathematical model of the observed 
exonic expression signal and the underlying transcript structure, 
we introduce a tuning parameter to penalize the selected regions 
in the selected transcripts that were not supported by the uncov- 
ered exonic regions, and we develop a constrained least square 
algorithm to adapt ively adjust the shrinkage level based on the 
exonic expression signal. We implement and integrate the new 
method into our existing GUI system, SAMMate, to detect and 
quantify active transcripts. Our tool takes a variety of RNA-Seq 
data formats, such as SAM or BAM, as input and output tran- 
script abundance through a few mouse clicks. Using simulation 
studies, our methods compare favorably with selected competing 
methods in terms of both time complexity and accuracy. We also 
demonstrate the potential applications by analyzing a real- world 
RNA-Seq data set. 

Availability: Both simulation data used for method com- 
parisons as well as the GUI tool are freely available at 
http : / / asammate . sourcef orge . net/ . 
Contact: dzhu@wayne.edu 



1 Introduction 

Detection and quantification of active transcripts using gene ex- 
pression data is essential to a wide range of transcriptomics re- 
search. The problem is non-trivial due to the fact that the ob- 
served exonic expression signal can be aggregated from a set 
of active transcripts encoded by the same gene with diverse 
alternative splicing mechanisms. Moreover, the excessive se- 
quencing errors and bia s existing in the dat a make the problem 
even more challenging ([Roberts et all 1 20 111 ; Ijones et all l2012h . 
In the earlier studies, several computational approaches have 



been developed to utilize high throughput gene expression pro- 
filing data collected from microarray experiments. For exam- 
ple, an Expectation-Maximization (EM) type of algorithm us- 
ing Expressed Sequence Tags (ESTs) (Xin g et all [2006) and 
a Nonnegative Ma trix Decomposition (NMF) based algorithm 
(Ant on et a/.| [2008) using exon and exon-exon junction microar- 
rays. Both approaches represent the pioneering efforts to tackle 
the isoform transcript quantification problem, however, their 
performances were limited by data quantity (ESTs) and qual- 
ity (microarrays). 

RNA-Seq technology substantially improved both data quan- 
tity and quality for a better detection and quantification 
of active transcripts. The existing approaches are either 
based on statistical modeli ng of read counts, such as Pois- 
son and Negat i ve Bi n omial (Ijiang and W ong 2009; W ang all 
l201Cb iLi etall l2010l; rTrapnelTe t al 2010; Nicol ae et all 1201 ll : 
iDeng et all 1201 ll ; iHu et all [20121 : iDu et all 120121 ) or based on 
mathematical modeling of the relationship between per-base 
exonic coverage signal and isoform transcript structures, such 
as rQ ua nt and IsoLasso dBohnert and Ratschl l2010l ; ILi et all 
1201 lal fbl; iNguven et a/.l l201ll ) . Despite the initial success in de- 
tection and quantification of active transcripts, significant chal- 
lenges remain in model misspecification and the resulted over- 
fitting. For example, as of June 2012, there are around 130, 000 
reference transcripts corresponding to some 20, 000 annotated 
human genes in the Ensembl database. 51% of these genes have 
5 or more annotated transcripts (Figure SI). However, only one 
major isoform transcript and one or two minor isoform tran- 
scripts are typically active under a single biological condition, 
making the total number of active transcripts per gene not ex- 
ceeding three. Thus model misspecification is inevitable for a 
vast majority of genes and appropriate model selection is ur- 
gently needed. 

The existing shrinkage approach detects active transcripts 
(without quantification) by shrinking the abundance proportion 
parameters associated with each isofo rm transcripts t oward zero 
using the standard Lasso procedure (|Li et a/.ll2011al ). The ap- 
proach strives to minimize the number of active transcripts while 
simultaneously to minimize the least-square difference between 
the observed the per-base exonic coverage signal and the ex- 
pected one from the model. It represents one of the first efforts 
to address the model misspecification and overfitting issues in 
detecting and qualifying the active transcripts. Since thousands 
to tens of thousands per-base exonic signal are often sufficient to 
estimate just a few isoform transcript proportion parameters, we 
argue that a straightforward application of Lasso shrinkage ap- 
proach can be less effective for detecting the active transcripts. 
In addition, the follow-up quantification of the active transcripts 
is also frequently demanded in transcriptomics research but not 
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Figure 1: Transcript abundance quantification using the observed per-base exonic expression signal (single sample case), r represents the gene 
expression abundance parameter, p represents abundance proportions of the three isoform transcripts, and e represents the observed per-base 
exonic expression signal. In this example, the gene has three reference transcripts and four annotated exons. The second transcript skips the 
second exon whereas the third transcript skips the third exon. (a) The model without shrinkage. Each scalar a corresponds to the expression 
signal of the i th exonic position and is expected to be the sum of expression of the transcripts covering that position, (b) The model with 
selective and adaptive shrinkage for model selection. A tuning parameter a is introduced to penalize the selected regions (second) of the selected 
transcripts (the first and the third) having no exonic expression signal. The shrinkage level is adaptively adjusted according to the exonic 
expression level over the optimization iterations. 



attainable by the standard Lasso approach. 

Here we present a new selective and adaptive shrinkage ap- 
proach to address the model misspecification and overfitting is- 
sues. Instead of shrinking all the isoform transcript abundance 
parameters, we do it only to the selected transcripts and to the 
selected regions on those transcripts that are not supported by 
the uncovered exonic regions. In Figure [T] the left panel illus- 
trates our model of per-base expression signal and transcript 
structures without shrinkage. The right panel introduces a tun- 
ing parameter a to the model to penalize the regions in the 
selected isoform transcripts that do not give exonic expression 
signal (e.g. the 2 nd exonic region). The tuning parameter is 
couched in an optimization algorithm so that the shrinkage level 
can be adaptively adjusted over iterations according to the ex- 
onic expression level. It is also seen from Figure [1] that another 
product of our model is a new metric = rp for transcript ex- 
pression. It provides an alternative to the existing transcrip t 
quantifi cation metrics, such as RPKM (Mortazavi et al 2008), 
FPKM (jTrapnell et q/j|2010h and r (|Li and D ewev 201 lh . 



2 Methods 

We develop two algorithms to detect and quantify active tran- 
scripts: one-step SASeq and iterative SASeq. One-step SASeq 
assumes known gene-level expression abundance, while iterative 
SASeq does not. Thus, the latter simultaneously quantifies both 
gene- level and isoform- level gene expression. 



2.1 One-step SASeq 

We first introduce the model in the case of single RNA-Seq sam- 
ple. We then extend it to accommodate biological and technical 
replicates. 



2.1.1 Gene structure model with selective and adap- 
tive shrinkage for a single sample 

We use per-base exonic gene expression signal, a vector for a 
single RNA-Seq sample obtained by aligning reads to a refer- 
ence genome, as inputs. For each gene, we denote the num- 
ber of bases in exonic regions by M, the number of annotated 
transcripts by N and the gene- level expression abundance by r. 
Then, the vector of observed per-base expression signal is given 
by e = [ei, e2, ejiif, where a is the observed read coverage at 
the i th exonic position. We also denote the vector of transcript 
proportions by p = [pi,P2, ■■-,Pn] T , where pj is the abundance 
proportion of j th transcript, ^2f =1 Pj = 1 and < p 3 < 1, for all 
j. The isoform transcript structures are represented as a splicing 
matrix, S G {0, l} MxiV , of 0's and l's. Each row i represents a 
single base and each column j represents an isoform transcript. 
Sij — 1 indicates that the j th isoform transcript contributes to 
the exonic signal observed at i th base, SV,=0 otherwise. 

Figure [TJl illustrates an example gene model without shrink- 
age. In this example, the 1 st transcript is full-length. Thus all 
the elements of the 1 st column of S take the value of 1 (Sji = 1 
for all j e [1...M]). In the 2 nd column of S, the elements cor- 
responding to the 2 nd exon take the value of because the 2 nd 
transcript skips the 2 nd exon. In the 3 rd column of S, the ele- 
ments corresponding to the 3 rd exon take the value of because 
the 3 rd transcript skips the 3 rd exon. 

At each exonic position i, the expected per-base exonic ex- 
pression signal is r^2^ =1 SijPj whereas the observed exonic ex- 
pression signal is a. The latter is accumulated from both exonic 
and junction read coverage. In addition to exonic reads, junction 
reads are indispensable for detecting and quantifying adjacent 
exons as they cover the exon-exon junction regions. They are 
also essential for detecting short exons whose expression signal 
may be only detectable by junction reads. Moreover, as the read 
length increases, a read is more likely to span multiple exons 
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hence carries valuable coverage information for our calculation. 
We write the observed per-base expression signal as a sum of the 
model explained portion and error as the following: 
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e = rSp + e. (1) 

For each gene locus, the relationship between the observed 
per-base exonic signal and the latent transcript proportions can 
be mathematically modeled as in equation (pQ), where e is the 
error term. In other words, we solve the following constrained 
linear least square problem: 
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< p < 1, 



(2) 



where 1 T = [1, 1, 1, ...1]. 

N 

Our goal is to minimize e while avoiding overfltting. Biologi- 
cally, only a subset of transcripts are active under a single con- 
dition; this is translated into a possible model misspecification 
that attempts to use the entire set of transcripts to explain the 
observed the exonic signal and to minimize error. For example, 
in Figure [TJd, it is very likely that the reads are originated from 
the second transcript. Otherwise, there should be exonic signal 
from the (blue) regions shared by the first and the third tran- 
scripts. Solving the constrained least square problem without 
shrinkage (Figure [TJi) may inflate proportions for the first and 
third transcripts just to overfit the gene model. On the other 
hand, the blind shrinkage approach shrinks all the transcripts so 
that the active transcripts can be erroneously removed from the 
model. 

Figure [2] shows that the vast majority of the genes (70.79%) 
have significant uncovered exonic regions even when they are 
highly expressed (> 10 in exonic coverage depth). We can penal- 
ize those isoforms with regions that overlap with the uncovered 
exonic regions to further improve the accuracy of our prediction. 
We develop a selective and adaptive shrinkage approach through 
modifying the splicing matrix according to the observed exonic 
signal. Denoting a as a tuning parameter, we propose a new se- 
lective and adaptive shrinkage model with the modified structure 
matrix S as the following: 
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Figure 2: Significant uncovered exonic regions exist in a vast major- 
ity of expressed genes. Each dot represents an expressed gene. The 
horizontal axis represents the percentage of uncovered bases. The ver- 
tical axis represents the log base 10 of gene expression abundance. 
The data set contains 200 million reads of 100 bases long. 8, 058 out 
of 11,383 expressed genes (70.79%) have gene expression abundance 
values greater than 10 and uncovered exonic regions greater than 10%. 



The above formulation may be sensitive to noise because it 
penalizes the exonic regions with exactly zero coverage. To ac- 
commodate the noise, we consider the average coverage of each 
exonic regions (Figure S2). We first calculate the average cov- 
erage signal over each region and then penalize those regions 
having very low average coverage compared to the overall cov- 
erage of the whole gene. Define a = [a±, <22, clm] T as a new 
vector, each scalar value a% is the average coverage of the region 
that the i th base belongs to. Formally, we use Equation Q to 
accommodate noise as follows: 



Sij — 



1 + a if Sij = l,di < 1, en < a, 
Sij otherwise, 



(4) 



for all i £ [1 . . . M] and j £ [1 ... N] , where a is the new signal- 
noise cutoff parameter with a default value of r/100. 

We rewrite the objective function as (e — rSp) T (e — rSp) = 



p i (r 2 S i S)p-2p i (rS i e; 



+ e T e. 



After dropping the constant 



term e 1 e and canceling out r from the objective function, J2} 
takes the form of a convex Quadratic Programming (QP): 



1 + a if Sij = 
Sij otherwise, 



1, ei 



0, 



(3) 



for alH £ [1 . . . M] and j £ [1 ... N], where a = r. The under- 
lying rationale is that the stronger the overall exonic signal r is, 
the more shrinkage applies to the regions of the transcripts that 
do not yield exonic signal. Moreover, the shrinkage level adap- 
tively depends on the overall gene expression abundance over the 
optimization iterations. 



min/(p) = ip T (rS T S)p-p T (S T e), 
p 2 

subject to l T p = 1, 
< p < 1, 



(5) 



where (rS T S) is symmetrical and positive semidefinite. Please 
refer to Appendix for technical details on quadratic program- 
ming. 
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2.1.2 Gene structure model with selective and adap- 
tive shrinkage for multiples samples 

In case of multiple samples, the same logic applies to es- 
timate the transcript proportions. For each gene locus, 
each sample (replicate) has a different expression abundances 
but sharing the same transcript proportion vector. Let L 
be th e number of the sam p les, a fter proper normalization, 
e.g. iRobinson and Oshlackl (|2010T ). we have: 1) L vectors 
of observed read coverage ei = [en, ei 2 , ■ ■ ■ , eiM] T , e 2 — 
[e 2 i, e 22 , . . . , e 2M ] T , • • - [e L i, e L 2, • • • , zlmY \ 2 ) L modi- 
fied splicing matrices Si, S 2 , . . . , Sl; 3) L gene expression abun- 
dances n, r 2 , . . . , t\l. We also have L equations: 



In the first step of the (k + l) th iteration, the splicing ma- 
trices are modified using the gene-level expression abundances 
r[ k \ • • • and as follows: 



1 + r[ k) if Sij =l,a,i< 1, a t < r^/100, 



(fc) , 



Sij otherwise, 



(9) 



for all I e [1...L], i G [1...M], and j e [1 . . . N]. De- 



noting V( fe+1 ) = [(rf ) S^ +1) ) T ,...,(ri fc) si fe+1) ) T ] T , Q( fc+1 ) = 
(Y( fc+1 )) T V (fc+1) , we recalculate the proportion vector p( fe+1 ) 
by solving the following convex QP: 
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where e is the error term that we want to minimize. The equa- 
tion above can be rewritten as: 
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Denoting y 



[ei , e 2 , . 



and V 

? 2 , • • • , tlSJ] t , the transcript proportions can be 
estimated by solving the following constrained linear least 
square problem: 



where y = [ef , e^, . . . , eJ j ] T as before. 

In the second step, given the new proportion vector p( /c+1 ) 
from we have e z = nS z (/c+1) p (/c+1) + e z for all / e [1 ... L], 

in which n is the subject to be optimized. To calculate r[ k+1 \ 
we solve a new optimization problem: 
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The algorithm iterates between the steps described in (|9]) , (|10p 
and dTTJ until the proportion vector and expression abundance 
values converge. 

3 Results 
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subject to 1 p = 1, 
< p < 1. 
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In fact, the optimization problem described in (0 is the gen- 
eralized form of (|2} since it formulates the optimization problem 
for one or multiple samples. This optimization problem can be 
transformed to a convex QP as follows: 



(8) 



min/(p) = ip T (V T V)p - p T (V T y), 
p 2 

subject to l T p = 1, 
< p < 1, 

where (V T V) is symmetrical and positive semidefinite. 



2.2 Iterative SASeq 

We iteratively estimate both gene-level expression abundances 
r^i\r^\" - and and transcript proportions p. For each 
sample I th , the expression abundance can be initialized by 
using per-base signal from the common regions shared by all 
isoform transcripts. Each iteration consists of two steps: the 
first step recalculates the transcript proportions while the second 
step updates the expression abundances. 



3.1 Simulation studies 

Using simulation studies, we demonstrate the ac curacy and 
speed of SASeq by comparing with RAEM (|Deng et all 
l201lL also impleme nted in the a SAMMate suite), R SEM 
(|Li and Dewevlboilh and Cufflinks (jTrapnell et aZj|2010h . We 
used FluxSimulator http://flux.sammeth.net/index.html, 
to simulate the whole transcriptome sequencing experiments. 
FluxSimulator takes reference transcript sequences as the input, 
randomly generates copy numbers for each transcript, fragments 
them, selects the ones of right sizes to sequence in silico, and fi- 
nally outputs the sequencing reads. Computational approaches 
for isoform detection and quantification can thus be compared 
with regard to the ground truth of isoform transcripts and their 
copy numbers. We generated 15 million, 30 million, 50 million, 
100 million and 200 million single-end reads with lengths of 50 
and 100 respectively from around 130, 000 reference transcripts 
available in Ensembl database (version 65), with various copy 
numbers (Figures S3, S4). The comparison results for the 50- 
mer reads are illustrated in Figures [3l [4] and S8, and the results 
for the 100-mer reads are presented in Figures S5, S6, S7 and 
S8. 

Figure [3] shows the time complexity of the different algo- 
rithms across an increasing number of reads on the same desk- 
top workstation (iMac, Intel Xeon DualCore 2.66GHz, 16GB 
RAM) except for RSEM. We run RSEM separately on a server 
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Runtime Comparison (50 Bases) 
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Figure 3: Runtime comparison of the competing algorithms. Horizon- 
tal axis represents the number of reads. From left to right corresponds 
to the older Illumina GAII (10 — 20 million reads per lane) to newer 
HiSeq (100 — 200 million reads per lane) platforms. The runtime of 
RSEM is at the magnitude of hours while all others are at the magni- 
tude of minutes as shown in vertical axis. 



(4 x Twelve-Core AMD Opteron 2.6GHz, 256GB RAM) due 
to its excessive computational demand. It is clear from Figure 
[3] that the runtime of single-t hread SASeq is qu ite comparable 
with the single-t hread RAEM (bene et aZ.ll201lh and the multi- 
thread Cufflinks (iTrapnell et aZ.l2010h, and it is much faster than 
the multi-thread RSEM ((lT and Dewev 201lh , especially for hun- 
dreds of millions reads generated from the newer Illumina HiSeq 
platform. 

We proceed to compare the accuracy of isoform transcript 
quantification. From the ground truth where we simulate RNA- 
Seq experiments using FluxSimulator, we know the true copy 
numbers of all the isoform transcripts. Thus the most accurate 
isoform transcript quantification algorithm will give a vector of 
isoform proportions that is least divergent from the vector of 
ground truth abundance. We used Jensen-Shannon (JS) diver- 
gence to capture both linear and nonlinear relationships and val- 
ues closer to indicate a better performance. 

In Figure [4ji, the horizontal axis represents the five competing 
methods with each tested using an increasing number of reads, 
corresponding to the real-world Illumina sequencers from the 
older GAII to the newer HiSeq. The vertical axis of Figure [4^ 
represents distributions of JS divergence between the predicted 
transcript abundance and the ground truth. Lower values indi- 
cate a better performance. In Figure 2}), we also plot the me- 
dian divergences of each algorithm across an increasing number 
of reads and observed a similar trend. It is clear from Figure 
|4] that SASeq, both one-step and iterative, outperforms their 
competitors in a vast majority of test cases. More strikingly, 
the superior performance of SASeq gets more pronounced with 
an increasing number of the reads. This is translated into a 
prospectively more powerful algorithm for the forthcoming ultra- 
high throughput sequencing data. In summary, from Figure [3] 
and Figure 21 SASeq demonstrates an impressive speed without 
compromise of accuracy. 

3.2 Real- world data analysis 

We also demonstrate the practical utilities of SASeq using a 
real- world RNA-Seq data set of six Mutu I wild- type cell line 
samples and six mir-155 expressing cell line samples (acces- 
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Figure 4: Comparison of transcript quantification accuracy using JS 
divergence. The upper panel (a) compares algorithms in terms of 
their distributions of divergences from the ground truth. The lower 
panel (b) compares algorithms in terms of their median divergences 
from the ground truth. Both one-step and iterative SASeq algorithms 
outperform their competitors. The performance contrasts get sharper 
with the increasing number of reads. 



sion number: SRA011001). The wild- type Mutu I cell line is 
the type I latency (limited viral gene expression) B-cell line 
whereas the mir-155 expression cell line is introduced by in- 
fecting the Mutu I cell line, in duplicate with an mir-155 ex- 
pressing retrovirus (or an empty vector control retrovirus) , to 
achieve high mir-155 expression (|Xu et a/.ll2010h . The goal of 
the data analysis is to detect the down-regulated transcripts 
that can be potential mir-155 targets. The RNA-Seq data set 
has been initially analyzed and a large number of 3'-UTR assays 
have been done to validate the predicted mir-155 ta rgets (down- 
regulated genes) (jXu et all [2010l ; iDeng et aZ.I l201lh . We first 
performed a per-base sequence quality check using fastQC Soft- 
ware (http: / / www.bioinformatics. bbsrc.ac.uk/ pro jects /fastqc/) . 
We then used TopHat (vl.4.1) (jTrapnell et al\ l2QQ9h to align 
short reads that are unique to the human reference genome 
(hgl9/GRCh37). Default settings and g 1 parameter were used. 
Alignment results were saved in a BAM format. We re- analyzed 
the data set using SASeq to re-discover the differentially ex- 
pressed transcripts that were reported before, and to predict new 
ones that were not. We used SAMMate 2.7.1 (implementation of 
SASeq algorithms) to calculate the gene-level and isoform-level 
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Figure 5: The number of predicted mir-155 targets between the four 
methods, (a) The isoform-level fold change distributio n of 124 t argets 
validated by 3'-UTR assay from the previous study (|Xu et al. 2010). 
(b) The comparison of target prediction using bitmap. Numbers at 
the top show the total predicted number of down-regulated genes of 
each method. Numbers to the left show the number of genes common 
in each respective row. For example, the first row indicates that 49 
genes are predicted by all of the methods whereas the second to last 
row indicates that 9 genes are uniquely predicted by SASeq. 



gene expression abundances. 

We compared the performance of Cufflinks, RAEM, RSEM, 
and SASeq on their capabilities of accurately detecting down- 
regulated isoform transcripts. We used a "gold standard" data 
set of mir-15 5 targets (genuin e down-regulation) validated by 3'- 
UTR assays (|Xu et a/.ll2010l ). Since this RNA-Seq data set has 
very low coverage, we pooled the short reads from the six sam- 
ples within each condition to gain more exonic coverage. Fur- 
thermore, we removed 25% of the transcripts that were least 
expressed in the control case from each method to make the fold 
change prediction more reliable. Since SASeq is not normal- 
ized by default as opposed to other methods, we also normal- 
ize its expression abundance using quant ile normalization. We 
then conducted isoform-level differential expression analysis to 
screen down-regulated isoform transcripts of the 124 mir-155 tar - 
gets that were validated in the previous study (|Xu et a/.ll2010l ). 
As shown in Figure [5Jl, the fold changes predicted by SASeq 
are more significant than other methods. Figure [5)3 shows that 
SASeq is able to significantly predict more down-regulated genes 
than other competing methods by using a cutoff as stringent as 
0.2. 

Particularly, we demonstrate two showcase examples in which 
SASseq detects down-regulated mir-155 targets at the isoform- 
level with little gene-level expression change. In Figure[6j the up- 
per panel shows the exonic expression signal for the gene TAF5L 
whereas the lower panel shows the structures of five reference iso- 
form transcripts and the associated data analysis results. A fold 
change of 0.83 as well as a visual inspection of the upper panel 
show no gene-level differential expression. Furthermore, for this 
gene the target isoform transcript ENST00000366675 has been 
validated by both RT-PCR an d 3'-UTR assay w ith fold changes 
of 0.38 and 0.30, respectively (jPeng et a/.ll201lh . Using SASeq 
we predicted the most accurate fold change of (0.44) compared 
to Cufflinks' (0.58), RAEM's (0.57), and RSEM's (0.52). Sim- 
ilarly, another showcase example PHF17 is presented in Figure 
S9. 



Wild-type Mutu I cell line (Control) 



j| k^jf 



miR-155 over expressing Mutu I cell line (Case) 



jini_ ... .hui^_iuii 



ENST00000366676 Fold Change = 0,04, Proportion : 14%, 1% 



-H 



ENST00000258281 Up-Regulated 



-4- 



ENST00000366675 Fold Change = 0,44, Proportion : 34%, 18% 



ENST00000477957 Fold Change = 0, Proportion : 37%, 0% 



Fold Change at Gene Level : 0.83 
RT-PCR: 0.38 
3'-UTR Assay : 0.30 



ENST00000366674 Not Expressed 



Gene : TAF5L 



Figure 6: An showcase example of gene TAF5L. SASeq predicts both 
ENST00000366675 and ENST00000366676 as down-regulated in mir- 
155 expressing cell line with no gene-level expression change. The 
former has a mir-155 seed in 3'-UTR, therefore it is predicted as a 
target. The prediction was validated by both RT-PCR and 3'-UTR 
assay (Deng et al. 201]]). 



4 Discussion 

In this paper, we presented a new selective and adaptive shrink- 
age approach to detect and quantify active transcripts using 
RNA-Seq. The ever- increasing numbers of reference transcripts 
in the database as well as their error rates aggravate the risk of 
model misspecification and overfitting. Therefore, model selec- 
tion via shrinkage opens a promising avenue to future research 
in this area. 

The key innovation of our approach is to shrink down the tran- 
script abundance proportions that were not supported by the ob- 
served per-base exonic expression signal and adapt ively adjust 
the shrinkage level accordingly. Our approach is an informed 
shrinkage approach, which is fundamentally diffe rent from the 
blind shrinkage approach, such as Lasso (Tibshirani 1994J) , in 
the following: (1). SASeq imposes both non-negative and sum- 
to-one constraints stipulated by the transcript detection and 
quantification problem. The standard Lasso shrinkage only con- 
strains the sum of absolute values of the coefficients to be no 
greater than a cutoff, which likely gives rise to negative abun- 
dance proportions for some transcripts. (2). Choosing the value 
of tuning parameter for the standard Lasso shrinkage approach 
is not straightforward because the number of active transcripts 
is unknown and this number varies from gene to gene and from 
condition to condition. On the other hand, SASeq automatically 
determines the value of the tunning parameter according to the 
overall exonic coverage signal of the gene. (3). For any gene, the 
number of active transcripts under a specific biological condition 
is usually very small compared to the number of reference tran- 
scripts available from databases. The Lasso approach blindly 
shrinks all the reference transcript abundance parameters in a 
non-discriminative way, albeit at different levels. It does not 
necessarily lead to the set of active transcripts. On the other 
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hand, SASeq only penalizes the selected regions of the selected 
transcripts that are not supported by the uncovered exonic ex- 
pression signal. Therefore, the remaining set of transcripts is 
more likely to be active under a specific biological condition. 

SASeq is prospectively more powerful in the near future with 
the accelerated augmentation of transcript database and increas- 
ing sequencing depth. The former will give a more complete set 
of transcripts to select from. The latter will permit a more accu- 
rate selection and shrinkage of the transcripts and their regions. 
In addition to working with transcripts database, SASeq is flex- 
ible enough to detect and quantify active t ranscripts from tran- 



scriptome assembly outputs in gtf format (iTrapnell et al 



Grabherr et al 201l|; iRobertson et al. 1 I2010U Schulz et al. 



2010 



2012 



Zhao et al I I2Q1 ih . where the model misspeciflcation is also an 



outstanding issue due to the excessive assembly bias and errors. 
Appendix 

Convex quadratic programming algorithm 

We use the active- set method (jNocedal and Wrightl l2006h to 
solve the convex QP described above. Our convex QP can be 
written as: 



min/(p) 
p 



1 

oP Qp 



T, 

P d, 



subject to 1 p = 1, 
< p < 1, 



(12) 



where Q is symmetric and positive semidefinite N x N matrix, 
d and p are vectors with N elements. This convex QP has one 
equality constraint and 2 N inequality constraints. 



for 



Convex 



Algorithm Active-Set Method 
QP 

Set p(°) = [1, 0, . . . , 0] T as the feasible starting point; 
Set to be a subset of the active constraints at p^ ; 

Set k = 0; 
loop 

Find Ap (fc) to minimize + Ap (/c) ) in the subspace 

defined by W (/c) ; 

if (Ap (/c) == 0) then 

Compute Lagrange multipliers of inequality-constraints 

included in W (fc) ; 

if (exists a negative Lagrange multiplier) then 

Obtain W^ k+1 ^ by dropping the corresponding con- 
straint from W (/c) ; 
Set p( fc+1 )=p^; 

else 

return p^; 

end if 
else 

if (p^ + Ap( fc ) is feasible with respect to all constraints) 
then 

Set p( fc+1 W fc )+ApW; 
Set W (fc+1) W (/c) 
else 

Set p( /c+1 ) = pW + ct (fc) Ap (fc) where a {k) is chosen to 
be the largest value for which all the constraints are 
satisfied; 

Obtain W^ k+1 ^ by adding the blocking constraints to 



end if 
end if 

Set k = k + 1; 
end loop 

For a given iteration p^ and a working set W^ k \ if the 
objective function is not minimized in the subspace defined 
by the working set, we compute the step Ap^ by solving an 
equality-constrained QP, in which the constraints correspond- 
ing to the working set are treated as equalities. We 
set p( fc+1 ) = p^ + Ap (/c) if it is feasible to all constraints. 
Otherwise, we set p (/c+1) = p (/c) + a (/c) Ap (/c) where the 
step-length chosen to be the largest value for which 
all the constraints are satisfied. The blocking constraint is 
also added to the working set. If the objective function is 
minimized (Ap^=0) in the subspace defined by but one 
of the Lagrange multipliers corresponding to the an inequality 
constraint in the working set is negative (does not satisfy the 
Karush-Kuhn- Tucker condition), we remove this constraint from 
the working set. Upon reaching a Karush-Kuhn- Tucker point 
that minimizes the objective function over its current working 
set, the algorithm terminates. 
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